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Method and apparatus for web information extraction service 



(57) A method provides Web information extraction 
service in an intelligent agent system having a client 
module and a server module. First of all, a Web browser 
in the client module is driven and connected to the 
server module via a network. The client module, a Java 
applet is downloaded from the server module to form a 
user interface window. Next, a target uniform resource 
locator (URL), a keyword and preset depth first search 
information are inputted and they are then sent to the 
server module. Web sites corresponding to the target 
URL and URL's of lower layers linked thereto are 
searched based on the depth first search information. 
Web information corresponding to the keyword is 
extracted from Web pages of the searched sites. There- 
after, the extracted information is processed and stored 
as a single user file. Finally, the stored information is 
transferred to the client module when all lower layer's 
URLs corresponding to the depth first search informa- 
tion are processed, to enable the user to browse the 
information using the Web browser. 
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Description 

[0001] The present invention relates to an intelli- 
gent Java agent system; and, more particularly, to a 
method and apparatus for Web information extraction 
service by using Java language. 
[0002] A Web has become one of the most widely 
used information service system worldwide. It is a 
worldwide Internet connection of individual information 
resources in diverse areas and for various purposes. 
Internet users can have an access to Web sites to 
search for desired information in the Web sites. As the 
number of sites increases and the size of documents 
keeps growing, Web users face difficulties in searching 
and collecting information pieces that are usually dis- 
persed over a large number of Web pages. There are 
two notorious problems in manually searching for the 
Web information within a hyper space of Web: disorien- 
tation and cognitive overload. The former implies the 
loss of current navigation position in the hyper space; 
and the latter implies user's burden to know exact site 
information such as uniform resource locator (URL) and 
directories in order to access desired Web sites. 
[0003] In recent years, the convenience using the 
document search function offered by search engines 
such as Yahoo, Infoseek, AltaVista, etc, has been widely 
recognized by Web users. Typically, the search engines 
maintain index information drawn from the Web sites, 
classify them in a form that suits users and provide the 
classified index information (e.g., URLs, titles and a few 
lines of expert) via Web browsers. By using this function 
of the search engines, the URL index search space can 
be narrowed. Moreover the URL index search space 
can further be narrowed down by another search 
engines, e.g., WebCompass and Microsoft Index 
Server, which employ sophisticated filtering and statisti- 
cal techniques. 

[0004] In so many cases, it may be preferred to 
extract and collect several interesting pieces of informa- 
tion into a single file form for easy and quick reference. 
To extract a relatively small amount of pieces of informa- 
tion which are usually dispersed throughout a large 
number of documents in the form of Web pages, in a 
conventional information extraction and collection 
scheme, a user manually looks into many hyperlink- 
connected Web pages, even after the user obtains a tar- 
get URL with some partial information (e.g., title : 
abstract, etc.) from a search engine. The manual search 
and collection of even relatively few and small pieces of 
data (e.g., sentences, paragraphs) are obviously labori- 
ous tasks. This will cost the user a great deal of time 
and effort, and the results may often be unsatisfactory, 
missing some unknown amount of relevant information. 
[0005] Most search engines that offer Web related 
information, on the other hand, employ a robot agent 
function. The robot agent function may be generally 
classified into four major functions: mirroring, statistical 
analysis, maintenance and resource discovery. Specifi- 



cally, the mirroring function takes a structure of a spe- 
cific site and fetches Web information from the specific 
site; the statistical analysis function looks into the 
number of hosts or servers; the maintenance function 
5 finds and eliminates a dead link in the search engine, 
etc.; and the resource discovery function automatically 
discovers required resources in their sites through an 
autonomy of the agent. 

[0006] It is known that systems implemented based 
io on these functions are Letizia, citation finder (CIFI), 
multi-owner maintenance spider (MOMspider), etc. In 
particular, a typical agent developed for the resource 
discovery on a dynamic Web is WebCrawler. This 
agent, however, has a limitation on that there is no intel- 
15 ligent element to choose hyperlinks for automatic navi- 
gation. 

[0007] Some other systems, such as world wide 
web query system (W3QS) and World Wide Web-based 
information retrieval and extraction (WIRE), aim for 

20 intelligent searches with a superior performance over 
the robot agent. However, these systems employ rela- 
tively complicated query techniques to overcome the 
problem of overwhelming amount of search results from 
the existing search engines. Specifically, the W3QS 

25 requires user's input in the form of W3QL, a specially 
designed structured query language; and the WIRE 
requires user's input in the form of a query tree. The 
complicated query forms requested by the systems can 
be a serious burden to most of users who are familiar 

30 with the keyword-based query in the Web. 

[0008] It is, therefore, an object of the present 
invention to provide a Web information extractor (WIE) 
to offer ordinary Web users an easy query way, an auto- 
matic hyperlink space search and a quick collection of 

35 desired contents with an intelligent information extrac- 
tion algorithm. 

[0009] In accordance with a preferred embodiment 
of the present invention, there is provided a method for 
Web information extraction service in an intelligent 
40 agent system having a client module and a server mod- 
ule, the method comprising the steps of: 

(a) driving and connecting the Web browser to the 
server module; 

45 (b) downloading, at the client module, Java applet 

from the server module to form a user interface win- 
dow; 

(c) inputting, on the user interface window, a target 
uniform resource locator (URL), a keyword and pre- 

so set depth first search (DFS) information and send- 
ing them to the server module; 

(d) searching, at the server module, Web sites cor- 
responding to the target URL and URL's of lower 
layers linked thereto based on the URL and the 

55 DFS information; 

(e) extracting Web information corresponding to the 
keyword from Web pages of the searched sites; 

(f) processing and storing the extracted information 
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as a single user file; 

(g) repeating the steps (d) to (f) until arriving lower 
layer's URL corresponding to the DFS information; 
and 

(h) sending the stored information to the client mod- 
ule to enable the user to browse the information 
using the Web browser. 

[0010] In accordance with another preferred 
embodiment of the present invention, there is provided 
an apparatus for Web information extraction service in 
an intelligent agent system comprising a client module 
and a server module, wherein the client module 
includes: 

means for driving and connecting the Web browser 
to the server module; 

means for downloading Java applet from the server 
module to form a user interface window; 
means for inputting a target uniform resource loca- 
tor (URL), a keyword and preset depth first search 
(DFS) information and sending them to the server 
module; and the server module includes: 
means for searching Web sites corresponding to 
the target URL and URL's of lower layers linked 
thereto based on the URL and the DFS information; 
means for extracting Web information correspond- 
ing to the keyword from Web pages of the searched 
sites; 

means for processing and storing the extracted 
information as a single user file; 
means for sending the stored information to the cli- 
ent module when arriving lower layer's URL corre- 
sponding to the DFS information, to enable the user 
to browse the information using the Web browser. 

[0011] The above and other objects and features of 
the present invention will become apparent from the fol- 
lowing description of preferred embodiments given in 
conjunction with the accompanying drawings, in which: 

Fig. 1 is a schematic diagram for explaining the 
overall concept of a novel WIE for extracting Web 
information in a hyper space in accordance with the 
present invention; 

Fig. 2 offers a detailed block diagram of the WIE in 
accordance with the invention; 
Fig. 3 presents a WIE user interface created 
through a Java applet downloaded in accordance 
with the invention; and 

Figs. 4A and 4B shows diagrams illustrating a 
hyperlink connection structure and an A-edge elim- 
inated tree structure, respectively. 

[0012] With reference to Fig. 1 , there is provided a 
diagram illustrating the overall concept of a novel WIE 
for extracting Web information in a hyper space 10 in 
accordance with an embodiment of the present inven- 



tion. In Fig. 1, there are illustrated three Web pages 
containing qualified pieces of information relevant to a 
keyword given by a Web user, wherein solid arrows indi- 
cate search paths while broken arrows indicate disqual- 

5 ified links. The WIE of the present invention includes 
three main functions: hyperlink traversal for finding 
desired information pieces, searching and collecting 
them into a user file. These functions are performed in 
accordance with a target URL provided from a search 

70 engine (not shown), a keyword and a depth first search 
(DFS) information, as depicted in Fig. 1. Consequently, 
the WIE provides the user with an extracted result 20 in 
the form of a single user file, which is obtained on the 
basis of the functions. Details of such functions will be 

75 provided with reference to Figs. 2 to 4 below. 

[0013] The WIE of the present invention, which will 
fully be explained later, employs four distinct features as 
follows. Firstly, the WIE employs a simple search-by- 
keyword operation for the reason that keyword-based 

20 Web search is the easiest and the most commonly used 
method by the Web users. Secondly, the WIE provides 
a single user file to the user by collecting several para- 
graphs, each containing a submitted keyword or key- 
word predicate. Thus, the user can benefit from the 

25 condensed data of his or her interest. Thirdly, the WIE 
provides a convenient user interface implemented with 
Java applet running on an Internet Web browser. This 
will enable the user to easily take advantage of all Java 
programming capabilities for better Internet operations. 

30 Finally, the WIE refines the service of the existing 
search engines rather than substitutes their services, by 
using a target URL as a search term. 
[0014] There are two possible approaches in the 
implementation of the WIE service over the Internet: 

35 implementation at a client module or server module. In 
a preferred embodiment of the invention, it is assumed 
that the latter approach is chosen for two following rea- 
sons. First, by offering the WIE service at the server 
module, Web users can be exempted from harnessing 

^o additional software on their clients. Second, the WIE 
can make use of more computing capability at the 
server module in a way similar to other search engines. 
The WIE can also be incorporated in an existing search 
engine to provide an extended content extraction serv- 

45 ice. 

[0015] Turning now to Fig. 2, there is shown a high 
level architecture of the WIE in accordance with the 
invention. The WIE comprises a client module 100 run- 
ning on a client computer (not shown) and a Java agent 

so (or server module) 200 running on a server computer 
(not shown). The client module 100 allows the user to 
submit queries for the hyperlink traversal and data col- 
lection. The client module 100 requires a target URL 
address and a simple keyword as its inputs. For this 

55 implementation, there is used a downloadable Java 
applet 140 running on a Web browser 120 so that the 
client module 100 can make use of most Java program- 
ming capabilities. A user interface (Ul) window 160 in 
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the client module 100 serves to interface the Java 
applet 140 to the server module 200 via a network 300. 
In other words, the Ul window 160 is generated through 
the Java applet 140 downloaded to maintain a connec- 
tion with the Java agent implemented on the server 
module 200 via a given socket or channel. 
[0016] Referring to Fig. 3, there is depicted a sche- 
matic diagram of the Ul window 160 in accordance with 
the present invention. In the Ul window 160 of the 
present invention, as shown, there are provided three 
input fields 162, 164, 166, and a start button 168 for 
starting a contents search navigation procedure in 
accordance with the embodiment of the invention. 
Among the input fields, the first field 162 is used to input 
the target URL, the second field 164 is used to input a 
keyword designating Web information the user wants, 
and the third field 166 is utilized to input DFS informa- 
tion. The target URL denotes one of URL's to be 
searched through the Web browser 120 and the DFS 
information indicates the depth of layers of URL's linked 
in a hierarchical structure. All the information inputted 
through the input fields 162, 164, 166 are provided to 
the server's Java agent 200 upon a pressure of the start 
button 168. 

[0017] Referring back to Fig. 2, the server module 
200 includes a fetch module (FM) 220, a link index table 
222, a search module (SM) 240 and a collection module 
(CM) 260. The FM 220 is connected to a Web server, 
e.g., one of servers 310 and 320 : corresponding to the 
target URL, and fetches hyper text mark-up language 
(HTML) pages from the Web server wherein hyperlink 
traversal and paragraph extraction operations are per- 
formed based on the target URL the keyword and the 
DFS information. For multiuser, the module 220 also 
manages query identifications (ID'S) to guarantee that 
resulting files are directed to correspondig query issu- 
ers. 

[0018] The SM 240 interacts with two submodules : 
a URL checker 242 and a syntax analyzer 244, to imple- 
ment a hyperlink traversal algorithm, wherein details of 
the hyperlink traversal algorithm will fully be provided 
with reference to Figs. 4A and 4B later. The URL 
checker 242 is used to exclude hyperlinks pointing to 
irrelevant URLs that are either not in the HTML format 
(e.g., ps, doc, and ppt) or no HTTP protocol (e.g., mail, 
ftp, and gopher). The syntax analyzer 244 is responsible 
for making the resulting document organized in the 
HTML format so that the user can refer back to the orig- 
inal document whenever desired. Also the analyzer 244 
supports the semantic arrangement among information 
pieces extracted from different Web pages. Meanwhile, 
during the hyperlink traversal by the SM 240, the CM 
260 finds the related information pieces out of Web 
pages that have been fetched by the FM 220, and then 
collects them for later delivery to the user. The FM 220 
and the CM 260 are recursively operated during the 
hyperlink traversal by the SM 240. 
[0019] Finally, by the implementation of the inven- 



tion, the support of multithread concurrency for multi- 
user's access to the WIE can be achieved. With this 
configuration, the WIE provides client transparent con- 
nection and execution serving multiple users. Each 

5 thread plays a proxy role for a client's request. When the 
WIE receives a request, a thread (or proxy) is created 
and a connection between the thread and client is 
established. Once the connection is established, the 
WIE continues to maintain the connection until the 

to result of the request is provided to the client. 

[0020] Referring to Fig. 4A ( A1, A2 and A3 are 
pointing to nodes 1, 4 and 3, respectively, and thus do 
not make a hierarchical tree structure. Hyperlink con- 
nections form a hierarchical tree for a document with 

75 many irregular edges that do not conform to the hierar- 
chical tree structure. The irregular edges are called anti- 
hierarchical edges or A-edges simply. When these 
edges are removed, the hyperlink connection turns into 
a tree structure as shown in Fig. 4B so that a conven- 

20 tional DFS traversal algorithm can be applied. Conse- 
quently, the present invention is capable of improving 
the DFS traversal algorithm by detecting and eliminating 
the A-edges. During the link traversal by the FM 220 
and the SM 240, paragraph collection is carried out by 

25 the CM 260. 

[0021] For the hyperlink traversal, the DFS traversal 
algorithm allows the WIE service to maintain the hierar- 
chical semantics of the original document (i.e., chap- 
ters, sections and subsections) in the extracted results. 

30 in the implementation of the invention, A-edge elimina- 
tion algorithm is carried out by using a link index table 
(LIT) 222, as shown in Fig. 2, which records nodes that 
have been visited during the hyperlink traversal. Before 
traversing a node, the LIT 222 is looked up to determine 

35 whether or not a corresponding link is previously visited. 
The A-edge elimination algorithm of the invention is 
summarized as follows: 

1. Given a query containing a target URL and a key- 
40 word (or keyword predicate), the FM 220 assigns a 

query ID. 

2. The target URL becomes a root node forthe sub- 
sequent page traversal. 

3. There are two possible circumstances when the 
45 SM 240 examines the content of a page corre- 
sponding to the keyword as a search term. 

3.1 When the keyword is found in a plain text 
during the search, the CM 260 extracts a corre- 

so spending paragraph (as a container) under the 

query ID. 

3.2 When the keyword is found in a hyperlink 
address or description, the SM 240 records the 
address in the LIT 222 for later elimination of A- 

55 edges. Thus, all the traversed index links will be 

recorded in the LIT 222. 

4. Whenever the SM 240 detects a hyperlink, it 
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applet from the server module to form a user 
interface window; 

(c) inputting, on the user interface window, a 
target uniform resource locator (URL), a key- 
word and preset depth first search (DFS) infor- 
mation and sending them to the server module; 

(d) searching, at the server module, Web sites 
corresponding to the target URL and URL's of 
lower layers linked thereto based on the DFS 
information; 

(e) extracting Web information corresponding 
to the keyword from Web pages of the 
searched sites and processing and storing the 
extracted information as a single user file; 

(f) repeating the steps (d) to (e) until ail lower 
layer URL's corresponding to the DFS informa- 
tion are processed; and 

(g) sending the stored information to the client 
module to enable the user to browse the infor- 
mation using the Web browser. 

The method of claim 1 , wherein the server module 
is a Java agent system based on a Java language. 

The method of claim 1, wherein the step (d) 
includes the steps of: 

(d1) hierarchically searching the Web sites of 
the target URL and the lower layer's URUs by 
using the DFS information; 
(d2) determining whether or not each searched 
site's information exists within a set of previ- 
ously searched site's information stored in a 
look up table; and 

(d3) if it is determined at the step (d2) that said 
each searched site's information is within the 
set of previously searched site's information, 
excluding said each site's information. 

The method of claim 2, wherein the step (d) further 
includes the steps of: 



looks up the LIT 222 to determine whether the 
hyperlink is an A-edge or not If there is the address 
of the hyperlink in the LIT 222, it is regarded as an 
A-edge so that the SM 240 skips visiting the corre- 
sponding node. When a link is identified to be a 5 
qualified one to traverse, the FM 220 fetches a cor- 
responding page and the above item 3 is repeated. 
5. When the recursive traversals for all child nodes 
are completed, the A-edge elimination algorithm is 
terminated. Finally, the CM 260 notifies the query io 
issuer of the result of whole process in terms of a 
URL ad*p+lXdress. Thus, the user can view the 
extracted information on the Web browser using the 
URL address provided from the WIE. 

15 

[0022] By adding further functionalities to the WIE 
service, the present invention allows the user to specify 
any traversal depth as a partial input in the query. By 
doing so, the user can receive a quick response for 
some limited amount of return whenever desired. How- 20 
ever, it may be recognized to those skilled in the art that, 
in general, the traversal depth for most Web documents 2. 
is negligibly small. 

[0023] As can be seen from the above, the content 
extraction algorithm of the present invention provides a 25 3. 
significant advantage over the prior art manual extrac- 
tion on several benchmark documents of different sizes. 
For instance, if the size of document is small, e.g., if it is 
smaller than 2 4 Kbytes, there will not be big difference in 
the both approaches. If the document size is large, e.g., 30 
if it is larger than 2 13 Kbytes, however, the WIE 
approach of the invention outperforms the other with the 
ratio of 22 (note that when many pages are involved 
more manual operations are required). From the above, 
it is concluded that the efficiency of the WIE improves 35 
significantly as the document size increases. 
[0024] While the invention has been shown and 
described with respect to the preferred embodiments, it 
will be understood by those skilled in the art that various 
changes and modifications may be made without 40 4. 
departing from the scope of the invention as defined in 
the following claims. The features disclosed in the fore- 
going description, in the claims and/or in the accompa- 
nying drawings may, both separately and in any 
combination thereof, be material for realising the inven- as 
tion in diverse forms thereof. 



Claims 

1. A method for Web information extraction service in 
an intelligent agent system having a client module 
and a server module, the method comprising the 
steps of: 

(a) driving and connecting a Web browser con- 
tained in the client module to the server mod- 
ule; 

(b) downloading, at the client module, Java 



(d4) checking whether or not each of the 
searched sites is based on a preset communi- 
cations protocol, and if not, excluding said each 
site. 

5. The method of claim 4, wherein the preset commu- 
nications protocol is a hypertext transfer protocol- 
so 

6. The method of claim 1 , wherein the user interface 
window is connected to the server module via a 
socket which is independent of the Web browser. 

55 7. An apparatus for Web information extraction serv- 
ice in an intelligent agent system comprising a cli- 
ent module and a server module, 
wherein the client module includes: 
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means for driving and connecting a Web 
browser contained in the client module to the 
server mom«*e; 

means for aownloading Java applet from the 
server module to form a user interface window; 
and 

means for inputting a target uniform resource 
locator (URL), a keyword and preset depth first 
search (DFS) information and sending them to 
the server module; and the server module w 
includes: 

means for searching Web sites corresponding 
to the target URL and URL's of lower layers 
linked thereto based on the DFS information; 
means for extracting Web information corre- 75 
sponding to the keyword from Web pages of 
the searched sites and processing and storing 
the extracted information as a single user file; 
and 

means for sending the stored information to the 20 
client module when all lower layer's URL corre- 
sponding to the DFS information are proc- 
essed, to enable the user to browse the 
information using the Web browser. 

25 

The apparatus of claim 7, wherein the server mod- 
ule further includes means for storing the Web 
information. 

The apparatus of claim 7, wherein the inputting 30 
means is operated by using three input fields pre- 
pared on the user interface window. 
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